High validity real-world evidence study with deep phenotyping

ABSTRACT

Systems and methods are described for implementing an advanced, “research-grade” or “regulatory-grade,” real-world evidence (RWE) approach. The advanced RWE is able to extract a deep phenotype from rich data sources using advanced technologies including artificial intelligence. The rich data sources include both unstructured data and structured data from electric health records and may include additional data sources such as claims or registries. Systems and methods are also described for validating the deep phenotype which can then be used to create a patient cohort that may be linked to exposure or outcome data to make credible clinical assertions.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.17/581,403, filed Jan. 21, 2022, which claims the benefit under 35U.S.C. § 119(e) of U.S. Application No. 63/142,432, filed Jan. 27, 2021.The entire contents of all of the above-identified applications areincorporated herein by reference.

BACKGROUND

There is a national desire in the United States to implement real-worldevidence (RWE) within regulatory, reimbursement, and clinical pathwaysas a step toward personalized medicine, improved care, and moreefficient care. This will accelerate use of routinely collected data torefine care. By influencing what is approved, reimbursed, and selectedfor care, RWE will adjust the standard of care.

Adjusting the standard of care, however, can have unintended anddangerous consequences. Inaccurate data allowed into a patient'selectronic health record (EHR) has the potential to hurt one patient.Inaccurate data allowed into regulatory or reimbursement pathways canharm an entire nation.

RWE is often used to support trial recruitment, trial design, andmarketing insight. As it is increasingly used to make clinicalassertions, there is reason to believe that current approaches maybenefit from greater rigor. Claims data often have accuracy below 50%(see, e.g., Jollis et al., Ann Intern Med. 1993; 119(8):844-50 andLawson et al., Ann Surg. 2012; 256(6):973-81). Likewise, EHR problemlists often have accuracy below 60% (see, e.g., Luna et al., Stud HealthTechnol Inform. 2013; 192:417-21, Wright et al., Int J Med Inform. 2015;84(10):784-90, Onofrei et al., Inform Prim Care. 2004; 12(3):139-45, andParsons et al., J Am Med Inform Assoc. 2012; 19(4):604-9). Inparticular, the low accuracy can be attributed to low sensitivity. Lowsensitivity may incorporate bias since sicker patients with more touchpoints in the health system have more complete documentation. Incompletedata that lead to biased patient selection in a study can lead toincorrect clinical assertions.

There is, therefore, a need for new and more rigorous approaches to RWE.Further, such a need is urgent as regulators, payers, and providers areincreasingly incorporating RWE insights into their decision-makingprocesses.

SUMMARY

The present technology provides innovations in phenotyping, referencestandards, accuracy measurement, and enhanced privacy and security. Withsuch innovations, a deep phenotype can be extracted from rich datasources. The deep phenotype can be validated and then used to create apatient cohort that may be linked to exposure or outcome data to makecredible clinical assertions. Built upon such a deep phenotype, theadvanced RWE technology of the present disclosure can be recognized as“research-grade” or “regulatory-grade.”

In accordance with one embodiment of the present disclosure, provided isa method for defining a real-world evidence (RWE)-based cohort,comprising extracting, from unstructured data (optionally along withstructured data) in an electronic health record (EHR), e.g., using asemantic processing technique, a plurality of clinical concepts(optionally with concept attributes) associated with patients; mappingeach of the extracted clinical concepts to a coded clinical concept; andcomparing the mapped clinical concepts to inclusion or exclusioncriteria to define a cohort of the patients within the EHR that satisfya desired phenotype.

In some embodiments, the methods may further comprise generating aRWE-based registry for the cohort that comprises phenotypes associatedwith at least a subset of the cohort. In some embodiments, the methodsmay further comprise conducting a trial recruitment of at least a subsetof the cohort to enroll in a consented study. In some embodiments, themethods may further comprise conducting a trial recruitment of at leasta subset of the cohort to enroll in a randomized controlled trial (RCT).In some embodiments, the methods may further comprise conducting a trialrecruitment of at least a subset of the cohort to enroll in a pragmaticcontrol trial (PCT).

Another embodiment provides a method for defining a real-world evidence(RWE)-based cohort, comprising extracting, from unstructured data(optionally along with structured data) in an electronic health record(EHR), e.g., using a semantic processing technique, a plurality ofclinical concepts (optionally with concept attributes) associated withpatients; mapping each of the extracted clinical concepts to a codedclinical concept; and comparing the mapped clinical concepts to ananalytic model to determine a cohort risk profile. In some embodiments,the method may further comprise using the cohort risk profile withinvalue-based contracting.

Yet another embodiment provides a method for conducting a real-worldevidence (RWE) study, comprising extracting, from unstructured data inan electronic health record (EHR), a plurality of clinical conceptsrelating to phenotypes; identifying a cohort of patients from the EHRwith a patient phenotype that satisfies at least a portion of a criteriaof a study phenotype; obtaining, for the cohort, exposure data andoutcome data relating to at least a portion of the patients within theidentified cohort; and implementing a RWE study based on the patientphenotype associated with the exposure data or the outcome data for atleast one of the patients.

Still a further embodiment provides a method for generating a real-worldevidence (RWE)-based cohort having measured data accuracy, comprisingextracting, from unstructured data in an electronic health record (EHR),using a semantic processing technique, a plurality of clinical concepts;mapping each of the extracted clinical concepts to a coded clinicalconcept; comparing the mapped clinical concepts and concept attributesto inclusion or exclusion criteria to define a cohort of the patientswithin the EHR that satisfy a desired phenotype; creating a generatedgold standard for a portion of the clinical concepts within a portion ofthe patients within the cohort; and measuring an accuracy of thesemantic processing extraction of the clinical concepts for the cohortto determine validity of the cohort with respect to the generated goldstandard for a subset of the cohort, based on at least a portion of theinclusion or exclusion criteria.

In some embodiments, the methods may further comprise associating atleast a subset of the clinical concepts and concept attributes with adesired phenotype, wherein the desired phenotype satisfies a thresholdphenotypic similarity to a phenotype in a randomized controlled trial.In some embodiments, the methods may further comprise associating atleast a subset of the clinical concepts and concept attributes with adesired phenotype, wherein the desired phenotype satisfies a thresholdphenotypic similarity to a phenotype in an existing or anticipatedregulatory-approved label.

In some embodiments, the implementing of the RWE study comprisesconducting, based on the cohort, an observational study. In someembodiments, the implementing of the RWE study comprises comparingoutcomes from the outcome data of the cohort with outcomes from aninterventional study so that the cohort functions as a synthetic controlarm. In some embodiments, the implementing of the RWE study comprisescomparing outcomes of the cohort with outcomes from another cohort oranother study to determine comparative effectiveness of at least twotreatments. In some embodiments, the implementing of the RWE studycomprises implementing the association of the patient phenotype with theexposure data or the outcome data through data linkage with another dataset. In some embodiments, the another data set comprises a claims dataset or a registry.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a high validity RWE of the present disclosure thatincorporates a “deep” phenotype.

FIG. 2 illustrates an example process of implementing an advanced RWE.

FIG. 3 illustrates an example process of validating a phenotype.

FIG. 4 is a schematic illustrating the computing components that may beused to implement various features of the embodiments described in thepresent disclosure.

DETAILED DESCRIPTION

The present disclosure provides an improved approach to RWE, at least inpart rooted in computer technology, that overcomes the previouslydiscussed problems and achieves accuracy sufficient for credibleclinical assertions, also known as “research-grade” or“regulatory-grade” RWE.

Without limitation, the improved RWE is built upon the followinginnovations. First, a novel phenotyping approach is provided whichcombines structured data with unstructured enriched using artificialintelligence in order to achieve a deep phenotype. Second, the deepphenotype may be used to create a patient cohort which may be linked toexposure or outcome data either within an electronic health record (EHR)or from another data source. Third, a new procedure for assessing datavalidity is described, which measures data accuracy with accuracyrequirements defined within a study protocol. Fourth, integral to thenew data validity assessment method is a chart abstraction referencestandard with minimum required inter-rater reliability.

I. Real World Evidence (RWE)

United States healthcare is embroiled in a crisis of inconsistentquality and overwhelming cost. Through three administrations, thenational healthcare strategy has focused on using data and technology tocontrol costs and improve care.

In the first decade of healthcare data reform, between 2010 and 2020,there was uptake of electronic health records (EHRs) and implementationof population health programs. These achieved more consistentapplication of care pathways, though they did not change the standard ofcare.

In the second decade of healthcare data reform, from 2020 to 2030, thereis a desire to use routinely collected data to enhance the standard ofcare. While the randomized controlled trial (RCT) remains the bedrock ofclinical science, due to expense and difficult implementation, RCTscannot provide answers to all questions. Given a set of safe andeffective options to treat a condition, a medical professional willoften select between treatments based on personal experience rather thandata. The desire for better data to select a therapy based on individualpatient characteristics is not new. Over the years, it has been calledsubgroup analytics, comparative effectiveness, tailored therapy,personalized medicine, and precision medicine. The latest name, asadvocated by the 21^(st) Century Cures Act, is RWE.

The term “real-world evidence” or “RWE” denotes using routinelycollected data to provide evidence of clinical outcomes. The collecteddata may be observational data obtained outside the context ofrandomized controlled trials (RCTs). The data may be stored in EHRs,medical claims or billing activities databases, registries,patient-generated data, mobile devices, etc. It may be derived fromregistries.

The EHR as a Data Source for RWE

If the goal is to use data collected during routine care to enabletailored therapy, the EHR must be used. The EHR holds the majority ofclinical data collected during routine clinical care.

After a patient visit, the physician writes a detailed narrative of theencounter, may or may not add an item to the structured record such asthe problem list, and submits a claim to insurance to be paid. All ofthis information may be included in the patient's EHR.

The EHR typically includes both structured data and unstructured data.Structured data includes EHR problem, medication, lab, and other codedlists. Unstructured data typically constitutes the majority of EHRcontent, including physician notes, study reports, and other providernotes such as those from nurses and social workers.

Structured data lists represent a small fraction of data used in routinecare and are not intended for clinical studies. It is therefore notsurprising that they can result in incorrect clinical assertions. Forinstance, multiple studies have shown that problem lists are inaccurateand lack granularity. When a doctor records a problem, it is typicallytrue, meaning that precision is close to 100%. However, if a problemexists, the doctor may or may not record it in the problem list, meaningrecall (also referred to as “sensitivity,” which references whether acondition is documented if a patient has it) may be low in structureddata. Multiple peer-reviewed studies have shown that in the problemlist, sensitivity often falls below 50% in describing the most importantmedical conditions, such as cancer and heart attack.

The missingness results in bias. As an example, a study on heart attacktreatments demonstrating a 20% difference between study arms would bepoorly served by a data set missing 70% of heart attacks andpreferentially identifying only the most severe cases (as commonlyoccurs in problem lists). Increasingly, therefore, there is a desire togo to the heart of clinical documentation, the unstructured narrativeused in everyday care which contains 80% of relevant clinical content.But, the unstructured data is recorded as natural language text and isdifficult to process, which is one reason that structured data is morecommonly used in RWE.

Patient Selection to Create a Study Cohort

Patient selection in RWE studies typically involves defining inclusionand exclusion criteria and identifying patients in a registry or EHRthat meet these criteria. The term “phenotype” refers to a set ofpatient characteristics including problems, findings, meds, labs, andprocedures. The phenotype of a patient may be compared against inclusionand exclusion criteria for a study to determine whether the patient isincluded in or excluded from that study.

As illustrated in FIG. 1 , the phenotype represents the foundation ofall studies. If the wrong patients are selected for a study, there islittle value in assessing exposures and outcomes. For example, if astudy is assessing outcomes of patients with diabetes and hypertensionbut the cohort selected is patients with heart attack and no diabetes,then knowing the outcomes of that cohort is not helpful in determiningthe outcomes for patients with diabetes and hypertension.

Therefore, before an RWE study can be trusted to influence the standardof care, it should have an acceptable approach to patient selection andproven data validity to ensure the right patients are selected. Requiredcomponents include data availability, accuracy, and validation. Dataavailability references whether required data elements exist in the dataset. Data accuracy references whether the data element, if it exists, isconsistently captured. Data accuracy encompasses recall and precision.Data validation references whether accuracy is measured. This is usuallyperformed by comparing a subset of data against manual chart abstractionacting as a gold standard.

As an example of data availability, it is not possible to studydisease-free survival in a claims data set because that outcome is notcaptured in claims data. In terms of accuracy, it is inadvisable tostudy smoking in an EHR structured data set because the doctor addssmoking to the problem list for less than 10% of smokers. In terms ofvalidation, it is not safe to study most subgroups in claims or EHR datawithout validation since these data sets are known to have low recallfor granular concepts such as stage 3B colon cancer and systolic heartfailure. If half the relevant data in a study are missing, systematicbias can easily lead to erroneous study results.

Data validity, therefore, is a necessary precaution to allow RWE toinfluence the standard of care. However, current approaches to datavalidity are insufficiently rigorous to protect patients.

II. Traditional RWE

Patient phenotype is often extracted from claims and other types of EHRstructured data. As explained above, recall is often low in suchstructured data. Key variables may not be available, such as cardiacejection fraction is not recorded in this data set. More recently, therehas been increasing exploration of unstructured records which containkey variables with high accuracy. This led to a proliferation ofcompanies providing RWE in oncology. The approaches undertaken by suchcompanies, however, are flawed.

Traditional RWE is Inherently Inaccurate and Non-Scalable

The RWE approaches employed by modern RWE companies are manual. Each ofthem has hired an army of annotators to review and annotate unstructureddata. They typically use a structured data query to find thousands ofpatients with a disease out of millions of records. Annotators are thenasked to review the thousand records, but not the million.

First, such an approach is limited by the structured data query which isinflexible by definition, and cannot learn from the millions of records.

Second, accuracy is difficult or impossible to measure in traditionalRWE. Accuracy is measured as recall and precision. Recall is theproportion of patients correctly identified in each cohort. Precision isthe proportion of patients correctly identified divided by the totalnumber of patients identified in each cohort. The F1-score, one measureof accuracy, is the weighted harmonic mean of the precision and recall.In traditional RWE, even though the annotators may confirm that apatient identified using structured data meets criteria (precision),they never see patients that were missed out of millions of recordswhere the structured data element was not included (recall). Therefore,such a traditional RWE approach allows companies to measure precisionbut is inherently incapable of testing recall. Yet, recall is where mostof the error and bias exist.

Third, such a manual approach is highly expensive and not scalable.Within oncology, even though the approach is expensive, high-pricedmedications make such a high expense acceptable for some studies.However, only the pharmaceutical industry can afford these expensivestudies and only for high priority oncology diseases. For many diseasesoutside of oncology, the level of manual annotation required for largepopulation sizes is entirely infeasible.

III. Advanced RWE

Advanced RWE is More Believable than Traditional RWE

Advanced RWE is defined as high accuracy RWE sufficient to make aclinical assertion. A clinical assertion is a declaration that somethinghas or will happen. For example, a clinical assertion may state that onedrug is more effective than another for a subgroup of patients. Asanother example, a clinical assertion may state that patients with adisease have a specific rate of a specific outcome.

Traditional RWE does not measure accuracy and may be dangerous in makingclinical assertions. As a past example, traditional RWE asserted thathormone replacement therapy was useful for post-menopausal women. Thisled to more than a decade of inappropriate treatment of women with drugsthat provided no benefit and caused breast cancer. Inaccurate RWE canlead to incorrect clinical assertions that change behavior of doctors,insurance, and regulators. This behavior, also known as the standard ofcare, can be changed incorrectly and lead to wrong therapy and patientharm as occurred with hormone replacement therapy.

Advanced RWE assertions are more believable than traditional RWEassertions because advanced RWE is more accurate. The most common sourceof inaccuracy in RWE is selecting the wrong patient cohort. In RWE, thepatient cohort is selected based by matching patient phenotype againstinclusion and exclusion criteria for a study. A patient phenotype is aset of characteristics, diseases, symptoms, signs, procedures,medications, laboratory studies, and other clinical and non-clinicalinformation of a patient. If the patient phenotype is inaccurate, thepatient may be inappropriately included in or excluded from the study.This can make the clinical assertions wrong. For example, if a study isdesigned to test whether a drug is useful in diabetes, if an RWE processis used that selects patients who do not have diabetes, it will notmatter how accurately the outcomes are measured. That study, ifimproperly applied to diabetics, would assert that the drug is useful indiabetes where it may actually be harmful in diabetics. This would leadto inappropriate treatment and patient harm.

To solve these and other challenges, advanced RWE uses a deep phenotypeand data linkage.

A deep phenotype is a highly accurate phenotype. For example, if EHRstructured data was 50% accurate in correctly identifying whether apatient has cancer but EHR unstructured data and data enrichment was 90%accurate, the deep phenotype would require using EHR unstructured dataand data enrichment.

Data linkage is connecting multiple data sources for the same patient.For example, for a given patient, the phenotype from EHR unstructureddata may be linked to exposure information from EHR structured data. Asanother example, for a given patient, the phenotype from EHRunstructured and structured data may be linked to the outcome of thatpatient from a claims data or registry data set.

Advanced RWE is More Scalable than Traditional RWE

At least in this context, the present disclosure provides an advancedapproach to RWE that overcomes the problems associated with thetraditional RWE. This advanced RWE is able to extract a deep phenotypefrom rich data sources, using advanced technologies including artificialintelligence. The rich data sources include both unstructured data andstructured data from the EHR and may include additional data sourcessuch as claims or registries. The advanced technologies include naturallanguage processing, pattern recognition, inference, and otherartificial intelligence approaches. The extraction does not rely onstructured data queries which are inflexible and limited in scope. Usingrich data sources and advanced technical approaches allows the advancedRWE to retrieve relevant information useful for creating an enricheddata set which can be used to achieve a deep phenotype.

Moreover, the present disclosure provides methods for checking theaccuracy of the deep phenotype, including its recall, which none of thetraditional RWE is capable of measuring. Also, since measurement ispossible, the advanced RWE allows an RWE protocol to define a requiredlevel of accuracy within a particular study.

Therefore, the deep phenotype can be validated and then used to create apatient cohort that may be linked to exposure or outcome data to makecredible clinical assertions. Such advanced RWE based on a validateddeep phenotype, accordingly, can be recognized as “research-grade” or“regulatory-grade.”

Deep Phenotype

In order to make credible clinical assertions, the RWE needs to be basedon accurate extraction of relevant information from the EHR. StructuredEHR data simply does not contain all that information, and is far fromit. Unstructured data, however, presents an unsurmountable challenge toconventional RWE companies. Language is highly variable in healthcare. Adoctor may write “the patient has MA.” A study may have an inclusioncriterion of migraine with aura. But, “MA” in this context may representmass or migraine with aura. Conventional RWE companies do not have thetechnology to handle variable language in millions of records, each ofwhich may have thousands of pages of content.

Association of Concepts

An example of information relevant to the RWE is clinical concepts, orsimply concepts. Association of concepts may be useful for extraction ofconcepts, i.e., identifying concepts from narrative text, and inferenceof the concepts. Inference of concepts is using additional informationto accurately extract a concept.

Associations of concepts may be maintained as a table of associations,which includes pairs or groups of associated concepts. The benefit ofbuilding a table of associations is at least three-fold. First, byrecognizing related concepts within a narrative, the likelihood that theconcept is relevant is increased. Second, when attempting todisambiguate between meanings, relationships may be helpful. Third, whenthe system is trying to determine which symptom can be explained by adisease, understanding concept relationships can be used to assess eachsymptom against each disease.

The table of associations does not need to be perfect, as each conceptpair is not used in isolation. For example, in a planned application, asingle associated concept such as cough will not be sufficient tosupport pneumonia. Rather, multiple supporting concepts such as a subsetof cough, pulmonary infiltrates, gram stain, rales, Zithromax, and feverwill be required. Thus, even if a concept pair is incorrectly identifiedas being related or incorrectly identified as being unrelated, this willnot invalidate the system in which most concept pairs are correctlyidentified as being related or not related.

To build a table of associations, a large corpus of clinical narrativesor medical literature may be used. Software reading this content may useco-occurrence, token proximity, or healthcare knowledge databases suchas Systematized Nomenclature of Medicine to learn relationships. In thisway, for example, it may be learned that migraine, headache, and lightsensitivity are related. This learned set of associations may be used innatural language processing or inference algorithms to disambiguatetext, to identify a concept as meaningful, or to enable otherprocessing.

Extraction of Clinical Concepts

With the assistance of the table of associations, an advanced RWEapproach may start with extraction of concepts. See, e.g., step 202 inFIG. 2 which illustrates an example process of an advanced RWE. Clinicalconcepts can be extracted from the EHR dataset using artificialintelligence technologies such as natural language processing (NLP),pattern recognition, and inference. Clinical concepts may be problems,medications, procedures, and lab features, without limitation.

Clinical concept extraction is a specialized text extraction which is aprocess of extracting meaningful concepts from natural languagenarrative text. Simple text matching can be done with text matchingsoftware. A more robust approach, natural language processing, mayrecognize subject or negations as in “a brother with cancer” or “nohypertension.” A more robust approach combines natural languageprocessing with inference as in “Patient with high glucose, uncontrolledDM,” where DM can be recognized as diabetes mellitus based on inferencefrom nearby mention of high glucose. A more robust approach combinesnatural language processing, inference, and pattern recognition as in“Patient with MA. He describes worsening headache and lightsensitivity,” where the pattern of headache, light sensitivity, and MAis far more likely to be migraine with aura than mass.

Extracted concepts may undergo natural language processing, in somescenarios. Non-limiting examples of cleanup and tagging during naturallanguage processing include removal of special characters, tokenization,sentence splitter, part-of-speech tagger (e.g., tags tokens with part ofspeech tags such as adjectives, proper nouns), named entity recognition(which matches tokens against an internal map of entities); and negationand subject tagging.

Extracted concepts may undergo inference and pattern recognition, insome scenarios. Context may be used. Context may be as simple as sectiondetection or as complex as reviewing all concepts within a patientnarrative.

Section detection helps identify a narrative section to which clinicaltext can be attributed. This adds context in clinical conceptinterpretation. For example, a clinical concept appearing in a medicalhistory section may indicate a past condition instead of an ongoing one.Section information is useful in disambiguation of abbreviations andacronyms. For example, the abbreviation CP in a past medical historysection may favor cerebral palsy over chest pain depending on otherfeatures.

Each clinical note may include two, three or more sections. Withoutlimitation, such sections may be medical history, such as surgicalhistory (e.g., operation dates, operation reports, operationnarratives), obstetric history (e.g., pregnancies, any complications,pregnancy outcomes), medications and medical allergies, family history(e.g., immediate family member health status, cause of death, commonfamily diseases), social history (e.g., community support, closerelationships, past and current occupation), habits (e.g., smoking,alcohol consumption, exercise, diet, sexual history), immunizationrecords (e.g., vaccinations, immunoglobulin test), developmental history(e.g., growth chart, motor development, cognitive/intellectualdevelopment, social-emotional development, language development),demographics (e.g., race, age, religion, occupation, contactinformation), medical encounters (e.g., hospital admissions, specialistconsultations, routine checkups), chief complaint, history of thepresent illness, physical examination (e.g., vital signs, muscle power,organ system examinations), assessment and plan (e.g., diagnosis,treatment), orders and prescriptions, progress notes, and test results(e.g., imaging results, pathology results, specialized testing).

Context may utilize concepts outside of the sentence boundary and thesection header. For example, “Patient with MA. He describes worseningheadache and light sensitivity,” would require use of information notavailable to natural language processing since natural languageprocessing ends at the sentence boundary. In this case, a table ofassociations may be used to understand that headache and lightsensitivity are association with migraine with aura but are notassociated with mass. Thus, MA may be disambiguated to mean migrainewith aura rather than mass.

In some embodiments, along with the concepts, also extracted may beconcept attributes. Non-limiting examples of concept attributes includedate of occurrence, result, subject, negation, and importance.

Inference of Clinical Concepts

In some instances, the extracted clinical concepts can undergo furtherfiltration and/or enrichment to infer meaning (e.g., step 206 in FIG. 2). Enrichment, e.g., through pattern recognition, can enrich the set ofconcepts.

In one embodiment, inference may be used to infer which concepts arehighly relevant to a given patient's care as would be required toproduce a problem list. A problem list is a list of problems for apatient that are current, meaningful, and unique. In one embodiment, theextracted clinical concepts can be assessed against a knowledge databaseto (1) assess how important an extracted concept is in a patient's careand (2) remove symptoms that can be explained by diseases alreadyidentified for the patient. For example, if “the patient has lung cancerand also mentioned unrelated mild elbow pain during the clinic visit,”the cancer may be meaningful and the elbow pain may be less important.As another example, if “the patient has a cough and was diagnosed withpneumonia,” the pneumonia would be important but the cough is a knownsymptom of pneumonia and is not as important. Using associations ofconcepts, meaning and explanation can be inferred. For example, in thepatient with cancer, if the entire narrative discusses cancer imagingand cancer therapy but does not mention any associations of elbow pain,it can be inferred that cancer is more important. Thus, usingassociations can allow software to identify which concepts have moresupport from related concepts than another. A disease is a singlepathophysiologic state that produces signs or symptoms, while a symptomrefers to a physical or mental feature that is indicative of a disease.

In some scenarios, each extracted candidate clinical concept may beassessed for being meaningful by testing relationships of otherproblems, findings, signs, symptoms, procedures, medications, orlaboratory studies. For instance, 20-30 problems may exist in a typical5-7 page history and physical narrative in a typical electronic healthrecord. When a longitudinal record is parsed, however, natural languageprocessing may extract hundreds of potential problems resulting in aclinically meaningless problem list. Similar to problems, a typicalpatient's electronic health record may include 100+ findings, 2-20medications, 2-10 procedures, and 5-30 lab values. Thus, an inferencemodule may assess the likelihood that a clinical concept is meaningfulby identifying related references within the record.

For example, if two candidate problems are chest pain and pneumonia,first the system will test all discovered features against chest pain.Features within the narrative may include concepts such as EKG(procedure), ST elevation (finding), troponin (lab), coronary arterydisease (disease), and aspirin (medication). When compared against thetable of associations, there would be five concepts associated withchest pain, resulting in this concept being supported as a strongcandidate problem. The system will also assess pneumonia for support.The text may have been “r/o pneumonia” and the natural languageprocessing is not confident whether pneumonia is a real problem. In thistask, the concepts of EKG, troponin, coronary artery disease, andaspirin are unassociated with pneumonia as defined by the table ofassociations. Thus, pneumonia would be considered a less meaningful orlow likelihood problem.

To remove symptoms explained by a disease, each symptom in the candidateproblem list may be checked against all diseases within the table ofassociations to assess potential association. A symptom that is relatedto any disease in the candidate problem list will be discarded.

Concept Coding

The clinical concepts extracted from the EHR dataset, in some scenarios,can then be matched against a listing of coded (or predetermined)clinical concepts (e.g., step 204 of FIG. 2 ). Such matching ispreferably done before concept filtration/enrichment (as shown in FIG. 2), but can also be done afterwards. For instance for a cardiovascularmedicine study, the listing of coded clinical concepts may includehyperlipidemia, hypercholesterolemia, coronary artery disease, diabetesmellitus, myocardial infarction, chronic kidney disease, stroke,dementia, cataract, coronary artery bypass graft, atorvastatin,pravastatin, rosuvastatin, simvastatin, LDL cholesterol, HDLcholesterol, and total cholesterol. The list of concepts may bemaintained as an industry standard terminology such as SystematizedNomenclature of Medicine (SNOMED), Logical Observation Identifiers Namesand Codes (LOINC), International Classification of Diseases (ICD), orRxNorm. In some scenarios, prior to the matching, the extracted clinicalconcepts may be normalized along a canonical information model (e.g.,step 203 of FIG. 2 ).

Mapping an extracted clinical concept to a coded clinical concept can bedone with deterministic or probabilistic techniques, or the combinationsthereof. For inexact matches, a probabilistic model can be used to findthe most likely matches. Fuzzy matching, in some instances, is performedusing approximate dictionary matching.

Validating the Phenotype

Distilling a patient's longitudinal record into a set of coded conceptsis creation of a computable phenotype. The phenotype may includedemographics, diseases, symptoms, signs, findings, procedures,medications, laboratory studies, and other characteristics. For eachcoded concept, there may be additional attributes such as date ofoccurrence, result, subject, negation, and importance.

To run a study on a thousand patients that meet inclusion and exclusioncriteria, it may be necessary to test the phenotype of millions ofpatients. Thus, accurate identification of the phenotype is necessary.An example validation process is illustrated in FIG. 3 .

A first step in determining accuracy of the phenotype may be creation ofa gold standard for a subset of the patients. For example, if a millionpatients are analyzed for a cardiovascular study that has 20 inclusionand exclusion criteria, then a subset of this million may be assessed(e.g., step 302 in FIG. 3 ). The subset may be selected randomly. Forexample, 5,000 randomly selected longitudinal records may be sampled todetermine the accuracy of concept extraction for the million patients.

In one embodiment, the gold standard may include a person reviewing eachlongitudinal record for all or some of the 20 inclusion and exclusioncriteria and for additional relevant concepts (e.g., step 324 in FIG. 3). In a preferred embodiment, the gold standard may include two peoplereviewing each randomly selected record. These annotators may be blindedto each other's annotations and inter-rater reliability may be measured(e.g., step 326 in FIG. 3 ). In a preferred embodiment, a Cohen's kappascore may be measured and a minimum required kappa score may be requiredto deem the gold standard sufficiently accurate for use. A gold standardmay be considered generated if it meets the minimum requirement (e.g.,step 328 in FIG. 3 ).

For the sampled patient records, automated extraction (e.g., steps304-308 in FIG. 3 which correspond to steps 202-206 in FIG. 2 ,respectively) may be compared against the gold standard to determineaccuracy of extraction for the sampled patient records (e.g., step 340in FIG. 3 ). Accuracy may be measured as recall and precision. Accuracymay be measured as sensitivity and specificity. In a preferredembodiment, a study protocol may require minimum accuracy for a study.

For example, a study that is expecting an effect size of 20% whencomparing two treatments may require 80% accuracy of data. In thissituation, the study protocol may require a gold standard of at least1,000 patients with a minimum inter-rater reliability of 0.7.Additionally, the study may require a minimum precision and recall ofeach inclusion and exclusion criterion of 80%.

Testing the Coded Concepts Against a Study Phenotype

Once the accuracy of extraction has been established and a large corpusof patient records has been turned into an accurate set of codedconcepts, these coded concepts may be compared against the studyinclusion and exclusion criteria (e.g., step 208 of FIG. 2 ).Specifically, each patient will have a coded phenotype and the studywill have a required coded phenotype. For example, the study phenotypemay be patients with age>60 who have diabetes (SNOMED 44054006) but havenot had a heart attack (SNOMED 22298006). A patient may have a phenotypeof age 64, diabetes (SNOMED 44054006), obesity (SNOMED 414916001), andheart attack (SNOMED 22298006). This patient would not be included inthe study because the study excludes heart attack, but the patient hashad a heart attack. In this way, a large number of patients may beaccurately matched against a very detailed study phenotype.

Linking the Selected Patients to Exposures and Outcomes

Once a set of patients has been identified from the electronic healthrecord for inclusion in the study, it is necessary to understand theexposures and outcomes for each patient. An exposure is an interventionwhich may influence the course of care. This may be a medication,procedure, or other intervention. An outcome is a result of care. Thismay be a clinical outcome such as worsening pain, heart attack, or deathor may be a financial outcome such as cost of care.

Exposures or outcomes may be available in the EHR. In somecircumstances, medication treatments and procedures are tracked in anEHR. In some circumstances, a study may track symptoms or financialclaims that are stored in an EHR. In these circumstances, linkagerequires only linking different data sets from the EHR.

Often, exposures or outcomes are either not available in the EHR or notcomplete in the EHR. For example, a patient may be seen in a clinic, butmay have some prescriptions filled at an unaffiliated pharmacy. In thissituation, some of all medications the patient has received may not betracked in the EHR and linkage to a pharmacy data set may be required.As another example, if the tracked outcome is heart attack, this may notbe tracked in the EHR because the patient may have had a heart attackand be taken to the nearest hospital which has its own EHR system. Thismay require linkage to claims data since heart attack treatment isbillable and a database of claims would track that the patient submitteda claim for heart attack treatment. As another example, if the trackedoutcome is death, this may not be tracked in the EHR because the patientmay have died at home or a different hospital than where they routinelyreceive care. This may require linkage to a national death registry.

IV. Computing Systems for Implementing the Advance RWE

FIG. 4 is a block diagram that illustrates a computer system 400 uponwhich any embodiments of the advanced RWE and related technologies maybe implemented. The computer system 400 includes a bus 402 or othercommunication mechanism for communicating information, one or morehardware processors 404 coupled with bus 402 for processing information.Hardware processor(s) 404 may be, for example, one or more generalpurpose microprocessors.

The computer system 400 also includes a main memory 406, such as arandom access memory (RAM), cache and/or other dynamic storage devices,coupled to bus 402 for storing information and instructions to beexecuted by processor 404. Main memory 406 also may be used for storingtemporary variables or other intermediate information during executionof instructions to be executed by processor 404. Such instructions, whenstored in storage media accessible to processor 404, render computersystem 400 into a special-purpose machine that is customized to performthe operations specified in the instructions.

The computer system 400 further includes a read only memory (ROM) 408 orother static storage device coupled to bus 402 for storing staticinformation and instructions for processor 404. A storage device 410,such as a magnetic disk, optical disk, or USB thumb drive (Flash drive),etc., is provided and coupled to bus 402 for storing information andinstructions.

The computer system 400 may be coupled via bus 402 to a display 412,such as a LED or LCD display (or touch screen), for displayinginformation to a computer user. An input device 414, includingalphanumeric and other keys, is coupled to bus 402 for communicatinginformation and command selections to processor 404. Another type ofuser input device is cursor control 416, such as a mouse, a trackball,or cursor direction keys for communicating direction information andcommand selections to processor 404 and for controlling cursor movementon display 412. In some embodiments, the same direction information andcommand selections as cursor control may be implemented via receivingtouches on a touch screen without a cursor. Additional data may beretrieved from the external data storage 418.

The computer system 400 may include a user interface module to implementa GUI that may be stored in a mass storage device as executable softwarecodes that are executed by the computing device(s). This and othermodules may include, by way of example, components, such as softwarecomponents, object-oriented software components, class components andtask components, processes, functions, attributes, procedures,subroutines, segments of program code, drivers, firmware, microcode,circuitry, data, databases, data structures, tables, arrays, andvariables.

In general, the word “module,” as used herein, refers to logic embodiedin hardware or firmware, or to a collection of software instructions,possibly having entry and exit points, written in a programminglanguage, such as, for example, Java, C or C++. A software module may becompiled and linked into an executable program, installed in a dynamiclink library, or may be written in an interpreted programming languagesuch as, for example, BASIC, Perl, or Python. It will be appreciatedthat software modules may be callable from other modules or fromthemselves, and/or may be invoked in response to detected events orinterrupts. Software modules configured for execution on computingdevices may be provided on a computer readable medium, such as a compactdisc, digital video disc, flash drive, magnetic disc, or any othertangible medium, or as a digital download (and maybe originally storedin a compressed or installable format that requires installation,decompression or decryption prior to execution). Such software code maybe stored, partially or fully, on a memory device of the executingcomputing device, for execution by the computing device. Softwareinstructions may be embedded in firmware, such as an EPROM. It will befurther appreciated that hardware modules may be comprised of connectedlogic units, such as gates and flip-flops, and/or may be comprised ofprogrammable units, such as programmable gate arrays or processors. Themodules or computing device functionality described herein arepreferably implemented as software modules, but may be represented inhardware or firmware. Generally, the modules described herein refer tological modules that may be combined with other modules or divided intosub-modules despite their physical organization or storage.

The computer system 400 may implement the techniques described hereinusing customized hard-wired logic, one or more ASICs or FPGAs, firmwareand/or program logic which in combination with the computer systemcauses or programs computer system 400 to be a special-purpose machine.According to one embodiment, the techniques herein are performed bycomputer system 400 in response to processor(s) 404 executing one ormore sequences of one or more instructions contained in main memory 406.Such instructions may be read into main memory 406 from another storagemedium, such as storage device 410. Execution of the sequences ofinstructions contained in main memory 406 causes processor(s) 404 toperform the process steps described herein. In alternative embodiments,hard-wired circuitry may be used in place of or in combination withsoftware instructions.

The term “non-transitory media,” and similar terms, as used hereinrefers to any media that store data and/or instructions that cause amachine to operate in a specific fashion. Such non-transitory media maycomprise non-volatile media and/or volatile media. Non-volatile mediaincludes, for example, optical or magnetic disks, such as storage device410. Volatile media includes dynamic memory, such as main memory 406.Common forms of non-transitory media include, for example, a floppydisk, a flexible disk, hard disk, solid state drive, magnetic tape, orany other magnetic data storage medium, a CD-ROM, any other optical datastorage medium, any physical medium with patterns of holes, a RAM, aPROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip orcartridge, and networked versions of the same.

Non-transitory media is distinct from but may be used in conjunctionwith transmission media. Transmission media participates in transferringinformation between non-transitory media. For example, transmissionmedia includes coaxial cables, copper wire and fiber optics, includingthe wires that comprise bus 402. Transmission media can also take theform of acoustic or light waves, such as those generated duringradio-wave and infra-red data communications.

Various forms of media may be involved in carrying one or more sequencesof one or more instructions to processor 404 for execution. For example,the instructions may initially be carried on a magnetic disk orsolid-state drive of a remote computer. The remote computer can load theinstructions into its dynamic memory and send the instructions over atelephone line using a component control. A component control local tocomputer system 400 can receive the data on the telephone line and usean infra-red transmitter to convert the data to an infra-red signal. Aninfra-red detector can receive the data carried in the infra-red signaland appropriate circuitry can place the data on bus 402. Bus 402 carriesthe data to main memory 406, from which processor 404 retrieves andexecutes the instructions. The instructions received by main memory 406may retrieve and execute the instructions. The instructions received bymain memory 406 may optionally be stored on storage device 410 eitherbefore or after execution by processor 404.

The computer system 400 also includes a communication interface 418coupled to bus 402. Communication interface 418 provides a two-way datacommunication coupling to one or more network links that are connectedto one or more local networks. For example, communication interface 418may be an integrated services digital network (ISDN) card, cablecomponent control, satellite component control, or a component controlto provide a data communication connection to a corresponding type oftelephone line. As another example, communication interface 418 may be alocal area network (LAN) card to provide a data communication connectionto a compatible LAN (or WAN component to communicated with a WAN).Wireless links may also be implemented. In any such implementation,communication interface 418 sends and receives electrical,electromagnetic or optical signals that carry digital data streamsrepresenting various types of information.

A network link typically provides data communication through one or morenetworks to other data devices. For example, a network link may providea connection through local network to a host computer or to dataequipment operated by an Internet Service Provider (ISP). The ISP inturn provides data communication services through the world-wide packetdata communication network now commonly referred to as the “Internet”.Local network and Internet both use electrical, electromagnetic oroptical signals that carry digital data streams. The signals through thevarious networks and the signals on network link and throughcommunication interface 418, which carry the digital data to and fromcomputer system 400, are example forms of transmission media.

The computer system 400 can send messages and receive data, includingprogram code, through the network(s), network link and communicationinterface 418. In the Internet example, a server might transmit arequested code for an application program through the Internet, the ISP,the local network and the communication interface 418.

The received code may be executed by processor 404 as it is received,and/or stored in storage device 410, or other non-volatile storage forlater execution. Each of the processes, methods, and algorithmsdescribed in the preceding sections may be embodied in, and fully orpartially automated by, code modules executed by one or more computersystems or computer processors comprising computer hardware. Theprocesses and algorithms may be implemented partially or wholly inapplication-specific circuitry.

The various features and processes described above may be usedindependently of one another, or may be combined in various ways. Allpossible combinations and sub-combinations are intended to fall withinthe scope of this disclosure. In addition, certain method or processblocks may be omitted in some implementations. The methods and processesdescribed herein are also not limited to any particular sequence, andthe blocks or states relating thereto can be performed in othersequences that are appropriate. For example, described blocks or statesmay be performed in an order other than that specifically disclosed, ormultiple blocks or states may be combined in a single block or state.The example blocks or states may be performed in serial, in parallel, orin some other manner. Blocks or states may be added to or removed fromthe disclosed example embodiments. The example systems and componentsdescribed herein may be configured differently than described. Forexample, elements may be added to, removed from, or rearranged comparedto the disclosed example embodiments.

Any process descriptions, elements, or blocks in the flow diagramsdescribed herein and/or depicted in the attached figures should beunderstood as potentially representing modules, segments, or portions ofcode which include one or more executable instructions for implementingspecific logical functions or steps in the process. Alternateimplementations are included within the scope of the embodimentsdescribed herein in which elements or functions may be deleted, executedout of order from that shown or discussed, including substantiallyconcurrently or in reverse order, depending on the functionalityinvolved, as would be understood by those skilled in the art.

It should be emphasized that many variations and modifications may bemade to the above-described embodiments, the elements of which are to beunderstood as being among other acceptable examples. All suchmodifications and variations are intended to be included herein withinthe scope of this disclosure. The foregoing description details certainembodiments of the invention. It will be appreciated, however, that nomatter how detailed the foregoing appears in text, the invention can bepracticed in many ways. As is also stated above, it should be noted thatthe use of particular terminology when describing certain features oraspects of the invention should not be taken to imply that theterminology is being re-defined herein to be restricted to including anyspecific characteristics of the features or aspects of the inventionwith which that terminology is associated. The scope of the embodimentsshould, therefore, be construed in accordance with the appended claimsand any equivalents thereof.

The various operations of example methods described herein may beperformed, at least partially, by one or more processors that aretemporarily configured (e.g., by software) or permanently configured toperform the relevant operations. Similarly, the methods described hereinmay be at least partially processor-implemented, with a particularprocessor or processors being an example of hardware. For example, atleast some of the operations of a method may be performed by one or moreprocessors. Moreover, the one or more processors may also operate tosupport performance of the relevant operations in a “cloud computing”environment or as a “software as a service” (SaaS). For example, atleast some of the operations may be performed by a group of computers(as examples of machines including processors), with these operationsbeing accessible via a network (e.g., the Internet) and via one or moreappropriate interfaces (e.g., an Application Program Interface (API)).

The performance of certain of the operations may be distributed amongthe processors, not only residing within a single machine but deployedacross a number of machines. In some example embodiments, the processorsmay be located in a single geographic location (e.g., within a homeenvironment, an office environment, or a server farm). In other exampleembodiments, the processors may be distributed across a number ofgeographic locations.

Throughout this specification, plural instances may implementcomponents, operations, or structures described as a single instance.Although individual operations of one or more methods are illustratedand described as separate operations, one or more of the individualoperations may be performed concurrently, and nothing requires that theoperations be performed in the order illustrated. Structures andfunctionality presented as separate components in example configurationsmay be implemented as a combined structure or component. Similarly,structures and functionality presented as a single component may beimplemented as separate components. These and other variations,modifications, additions, and improvements fall within the scope of thesubject matter herein.

Although an overview of the subject matter has been described withreference to specific example embodiments, various modifications andchanges may be made to these embodiments without departing from thebroader scope of embodiments of the present disclosure. Such embodimentsof the subject matter may be referred to herein, individually orcollectively, by the term “invention” merely for convenience and withoutintending to voluntarily limit the scope of this application to anysingle disclosure or concept if more than one is, in fact, disclosed.

The embodiments illustrated herein are described in sufficient detail toenable those skilled in the art to practice the teachings disclosed.Other embodiments may be used and derived therefrom, such thatstructural and logical substitutions and changes may be made withoutdeparting from the scope of this disclosure. The Detailed Description,therefore, is not to be taken in a limiting sense, and the scope ofvarious embodiments is defined only by the appended claims, along withthe full range of equivalents to which such claims are entitled.

As used herein, the term “or” may be construed in either an inclusive orexclusive sense. Moreover, plural instances may be provided forresources, operations, or structures described herein as a singleinstance. Additionally, boundaries between various resources,operations, and data stores are somewhat arbitrary, and particularoperations are illustrated in a context of specific illustrativeconfigurations. Other allocations of functionality are envisioned andmay fall within a scope of various embodiments of the presentdisclosure. In general, structures and functionality presented asseparate resources in the example configurations may be implemented as acombined structure or resource. Similarly, structures and functionalitypresented as a single resource may be implemented as separate resources.These and other variations, modifications, additions, and improvementsfall within a scope of embodiments of the present disclosure asrepresented by the appended claims. The specification and drawings are,accordingly, to be regarded in an illustrative rather than a restrictivesense.

Conditional language, such as, among others, “can,” “could,” “might,” or“may,” unless specifically stated otherwise, or otherwise understoodwithin the context as used, is generally intended to convey that certainembodiments include, while other embodiments do not include, certainfeatures, elements and/or steps. Thus, such conditional language is notgenerally intended to imply that features, elements and/or steps are inany way required for one or more embodiments or that one or moreembodiments necessarily include logic for deciding, with or without userinput or prompting, whether these features, elements and/or steps areincluded or are to be performed in any particular embodiment.

Although the invention has been described in detail for the purpose ofillustration based on what is currently considered to be the mostpractical and preferred implementations, it is to be understood thatsuch detail is solely for that purpose and that the invention is notlimited to the disclosed implementations, but, on the contrary, isintended to cover modifications and equivalent arrangements that arewithin the spirit and scope of the appended claims. For example, it isto be understood that the present invention contemplates that, to theextent possible, one or more features of any embodiment can be combinedwith one or more features of any other embodiment.

1. A computer-implemented method, comprising: extracting, fromunstructured data in an electronic health record (EHR), a plurality ofclinical concepts using a semantic processing technique; mapping each ofthe extracted clinical concepts to a coded clinical concept; comparingthe mapped clinical concepts and concept attributes to inclusion orexclusion criteria to define a cohort of patients within the EHR thatsatisfy a desired phenotype; creating a generated gold standard for aportion of the clinical concepts within a portion of the patients withinthe cohort; and measuring an accuracy of the semantic processing-basedextraction of the clinical concepts for the cohort to determine validityof the cohort with respect to the generated gold standard for a subsetof the cohort, based on at least a portion of the inclusion or exclusioncriteria.
 2. The method of claim 1, wherein the extracting the pluralityof clinical concepts comprises: (a) obtaining a table of associationthat maintains associations of clinical concepts; (b) extracting, usingan artificial intelligence technology, from a patient record, theplurality of clinical concepts; (c) determining a level of support foreach extracted clinical concept at least based on an association betweenthe extracted clinical concept and other clinical concepts extractedfrom the patient record according to the table of association; (d)identifying, from the plurality of clinical concepts by checking thetable of association, a clinical concept representing a symptom alreadyexplained by another clinical concept in the plurality of clinicalconcepts representing a disease; and (e) filtering the extractedclinical concepts by exclusion of (1) extracted clinical concepts havingrelatively lower levels of support among the extracted clinicalconcepts, and (2) the clinical concept identified in (d).
 3. The methodof claim 2, further comprising: constructing the table of associationbased on a corpus of clinical narratives or medical literature.
 4. Themethod of claim 1, wherein the inclusion or exclusion criteria aregenerated by at least: associating at least a subset of the clinicalconcepts and concept attributes with the desired phenotype, wherein thedesired phenotype satisfies a threshold phenotypic similarity to aphenotype in a randomized controlled trial.
 5. The method of claim 1,wherein the inclusion or exclusion criteria are generated by at least:associating at least a subset of the clinical concepts and conceptattributes with the desired phenotype, wherein the desired phenotypesatisfies a threshold phenotypic similarity to a phenotype in anexisting or anticipated regulatory-approved label.
 6. The method ofclaim 1, further comprising: in response to the accuracy being above athreshold, obtaining, for the cohort, exposure data and outcome datarelating to at least a portion of the patients within the cohort ofpatients; and implementing a real-world evidence (RWE) study based onthe patient phenotype associated with the exposure data or the outcomedata for at least one of the patients.
 7. The method of claim 6, whereinthe implementing of the RWE study comprises comparing outcomes from theoutcome data of the cohort with outcomes from an interventional study sothat the cohort functions as a synthetic control arm.
 8. The method ofclaim 6, wherein the implementing of the RWE study comprises comparingoutcomes of the cohort with outcomes from another cohort or anotherstudy to determine comparative effectiveness of at least two treatments.9. The method of claim 6, wherein the implementing of the RWE studycomprises conducting, based on the cohort, an observational study. 10.The method of claim 6, wherein the implementing of the RWE studycomprises comparing outcomes of cohorts based on demographicallydistinct subpopulations on similar treatment regimens to understandheterogeneity of treatment effects on those subpopulations.
 11. Themethod of claim 6, wherein the implementing of the RWE study comprisesmultiple subgroups to determine preferred design of a randomizedcontrolled trial (RCT).
 12. The method of claim 6, wherein theimplementing of the RWE study comprises implementing the association ofthe patient phenotype with the exposure data or the outcome data throughdata linkage with another data set.
 13. The method of claim 6, whereinthe implementing of the RWE study comprises implementing the associationof the patient phenotype with the exposure data and the outcome data foridentifying patient safety events for pharmacovigilance.
 14. Anon-transitory computer-readable storage medium configured withinstructions executable by one or more processors to cause the one ormore processors to perform operations comprising: extracting, fromunstructured data in an electronic health record (EHR), a plurality ofclinical concepts using a semantic processing technique; mapping each ofthe extracted clinical concepts to a coded clinical concept; comparingthe mapped clinical concepts and concept attributes to inclusion orexclusion criteria to define a cohort of patients within the EHR thatsatisfy a desired phenotype; creating a generated gold standard for aportion of the clinical concepts within a portion of the patients withinthe cohort; and measuring an accuracy of the semantic processing-basedextraction of the clinical concepts for the cohort to determine validityof the cohort with respect to the generated gold standard for a subsetof the cohort, based on at least a portion of the inclusion or exclusioncriteria.
 15. The non-transitory computer-readable storage medium ofclaim 14, wherein the extracting the plurality of clinical conceptscomprises: (a) obtaining a table of association that maintainsassociations of clinical concepts; (b) extracting, using an artificialintelligence technology, from a patient record, the plurality ofclinical concepts; (c) determining a level of support for each extractedclinical concept at least based on an association between the extractedclinical concept and other clinical concepts extracted from the patientrecord according to the table of association; (d) identifying, from theplurality of clinical concepts by checking the table of association, aclinical concept representing a symptom already explained by anotherclinical concept in the plurality of clinical concepts representing adisease; and (e) filtering the extracted clinical concepts by exclusionof (1) extracted clinical concepts having relatively lower levels ofsupport among the extracted clinical concepts, and (2) the clinicalconcept identified in (d).
 16. The non-transitory computer-readablestorage medium of claim 15, wherein the operations further comprise:constructing the table of association based on a corpus of clinicalnarratives or medical literature.
 17. The non-transitorycomputer-readable storage medium of claim 14, wherein the inclusion orexclusion criteria are generated by at least: associating at least asubset of the clinical concepts and concept attributes with the desiredphenotype, wherein the desired phenotype satisfies a thresholdphenotypic similarity to a phenotype in a randomized controlled trial.18. The non-transitory computer-readable storage medium of claim 14,wherein the inclusion or exclusion criteria are generated by at least:associating at least a subset of the clinical concepts and conceptattributes with the desired phenotype, wherein the desired phenotypesatisfies a threshold phenotypic similarity to a phenotype in anexisting or anticipated regulatory-approved label.
 19. Thenon-transitory computer-readable storage medium of claim 14, wherein theoperations further comprise: in response to the accuracy being above athreshold, obtaining, for the cohort, exposure data and outcome datarelating to at least a portion of the patients within the cohort ofpatients; and implementing a real-world evidence (RWE) study based onthe patient phenotype associated with the exposure data or the outcomedata for at least one of the patients.
 20. The non-transitorycomputer-readable storage medium of claim 19, wherein the implementingof the RWE study comprises conducting, based on the cohort, anobservational study.